Prop3D: A flexible, Python-based platform for machine learning with protein structural properties and biophysical data

Background Machine learning (ML) has a rich history in structural bioinformatics, and modern approaches, such as deep learning, are revolutionizing our knowledge of the subtle relationships between biomolecular sequence, structure, function, dynamics and evolution. As with any advance that rests upon statistical learning approaches, the recent progress in biomolecular sciences is enabled by the availability of vast volumes of sufficiently-variable data. To be useful, such data must be well-structured, machine-readable, intelligible and manipulable. These and related requirements pose challenges that become especially acute at the computational scales typical in ML. Furthermore, in structural bioinformatics such data generally relate to protein three-dimensional (3D) structures, which are inherently more complex than sequence-based data. A significant and recurring challenge concerns the creation of large, high-quality, openly-accessible datasets that can be used for specific training and benchmarking tasks in ML pipelines for predictive modeling projects, along with reproducible splits for training and testing. Results Here, we report ‘Prop3D’, a platform that allows for the creation, sharing and extensible reuse of libraries of protein domains, featurized with biophysical and evolutionary properties that can range from detailed, atomically-resolved physicochemical quantities (e.g., electrostatics) to coarser, residue-level features (e.g., phylogenetic conservation). As a community resource, we also supply a ‘Prop3D-20sf’ protein dataset, obtained by applying our approach to CATH. We have developed and deployed the Prop3D framework, both in the cloud and on local HPC resources, to systematically and reproducibly create comprehensive datasets via the Highly Scalable Data Service (HSDS). Our datasets are freely accessible via a public HSDS instance, or they can be used with accompanying Python wrappers for popular ML frameworks. Conclusion Prop3D and its associated Prop3D-20sf dataset can be of broad utility in at least three ways. Firstly, the Prop3D workflow code can be customized and deployed on various cloud-based compute platforms, with scalability achieved largely by saving the results to distributed HDF5 files via HSDS. Secondly, the linked Prop3D-20sf dataset provides a hand-crafted, already-featurized dataset of protein domains for 20 highly-populated CATH families; importantly, provision of this pre-computed resource can aid the more efficient development (and reproducible deployment) of ML pipelines. Thirdly, Prop3D-20sf’s construction explicitly takes into account (in creating datasets and data-splits) the enigma of ‘data leakage’, stemming from the evolutionary relationships between proteins.

3 How Prop3D abides by the FAIR guidelines In general, the creation of scalable, reproducible scientific workflows faces challenges that stem from the sheer volume and heterogeneity of available data-sources and data-types, in addition to potential other factors such as the variable range of computing platforms, architectures and capabilities that one may seek to deploy a workflow across (e.g., multi-core processing on a local workstation, versus a Linux HPC cluster, versus an eScience grid or other highly distributed network environment in the cloud).The 'FAIR' principles for scientific data provide a set of best-practices that contribute to the research enterprise by striving to make datasets Findable, Accessible, Interoperable, and Reproducible.In other words, FAIR datasets should be (i) easy to find, with appropriate metadata to facilitate searching by others; (ii) one should be able to access all of the data easily, without undue effort; (iii) one should be able to integrate and otherwise interoperate the data with other data-sources and software frameworks; and (iv) the data should be (re)usable and replicable by others (a bedrock of the scientific method).When possible, these guidelines (https://www.go-fair.org/fair-principles)would apply equally well to both the datasets themselves as well as to the code that underlies the data-generating and data-processing/analysis/reduction pipelines-i.e., the software framework would be FAIR-compliant, insofar as its resultant data are FAIR.The following enumerates how Prop3D complies with these guidelines: 1. Findable The first step in (re)using data is to be able to find it.Metadata and data should be easy to find for both humans and computers.Machine-readable metadata are essential for automated discovery of datasets and services, so this is an essential component of the FAIRification process.In Prop3D, the following hold:

Interoperable
Datasets generally are not used in a vacuum, and at some point or another will need to be integrated with other types and sources of data.In addition, the data need to interoperate with available applications or workflows for analysis, storage, and processing.In Prop3D, I1. (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
I2. (Meta)data use vocabularies that follow FAIR principles I3. (Meta)data include qualified references to other (meta)data

Reusable
The ultimate goal of FAIR is to optimise the reuse of data.To achieve this, metadata and data should be well-described so that they can be replicated and/or combined in different settings.In Prop3D, F1. (Meta)data are assigned a globally unique and persistent identifier F2.Data are described with rich metadata (defined by R1 below) F3.Metadata clearly and explicitly include the identifier of the data they describe F4. (Meta)data are registered or indexed in a searchable resource2.AccessibleOnce a user finds the required data, she/he/they need to know how such data can be accessed, possibly including issues of authentication and authorisation.In Prop3D, A1. (Meta)data are retrievable by their identifier using a standardised communications protocol A1.1 The protocol is open, free, and universally implementable A1.2The protocol allows for an authentication and authorisation procedure, where necessary A2.Metadata are accessible, even when the underlying data are no longer available R1. (Meta)data are richly described with a plurality of accurate and relevant attributes R1.1.(Meta)data are released with a clear and accessible data usage license R1.2.(Meta)data are associated with detailed provenance R1.3.(Meta)data meet domain-relevant community standards

1
Sequence-based bioinformatics tools available in Prop3D

Table 1 :
Sequence -based bioinformatics tools available in Prop3D.Most of these tools have been dockerized, and are available at our Docker Hub (https://hub.docker.com/u/edraizen).

Table 2 :
Structural bioinformatics software suites available in Prop3D.Most of these tools have been dockerized, and are available at our Docker Hub (https://hub.docker.com/u/edraizen).